
arthritis 

researcnStnerapy 




==j 



pi 







Clinical Status 








CG 


OA 


RA 


Sum 




CG 


TN 

19 


FNoa 
0 


FN RA 
0 


19 


ssificai 


OA 


FPoa 

1 


26 


FPoa 

1 


28 


O 


RA 


FP RA 
0 


FPr a 
0 


TPra 
32 


32 




Sum 


20 


26 


33 


79 




H ts 


- 


26 


32 


77 




Errors 


1 


0 


1 


2 



Identification of rheumatoid arthritis and 
osteoarthritis patients by transcriptome-based 
rule set generation 

Woetzel et al. 



BioMed Central 



Woetzel et al. Arthritis Research & Therapy 2014, 16:R84 
http://arthritis-research.eom/content/16/2/R84 



Woetzel et al. Arthritis Research & Therapy 2014, 16:R84 
http://arthritis-research.eom/content/16/2/R84 



m* arthritis 

mmm researcnStnerapy 



RESEARCH ARTICLE Open Access 



Identification of rheumatoid arthritis and 
osteoarthritis patients by transcriptome-based 
rule set generation 

Dirk Woetzel 1 , Rene Huber 2 ' 3 , Peter Kupfer 4 , Dirk Pohlers 2 ' 5 , Michael Pfaff 1 ' 6 , Dominik Driesch 1 , Thomas Haupl 7 , 
Dirk Koczan 8 , Peter Stiehl 9 , Reinhard Guthke 4 and Raimund W Kinne 2 * 



Abstract 

Introduction: Discrimination of rheumatoid arthritis (RA) patients from patients with other inflammatory or 
degenerative joint diseases or healthy individuals purely on the basis of genes differentially expressed in 
high-throughput data has proven very difficult. Thus, the present study sought to achieve such discrimination 
by employing a novel unbiased approach using rule-based classifiers. 

Methods: Three multi-center genome-wide transcriptomic data sets (Affymetrix HG-U133 A/B) from a total of 79 
individuals, including 20 healthy controls (control group - CG), as well as 26 osteoarthritis (OA) and 33 RA patients, 
were used to infer rule-based classifiers to discriminate the disease groups. The rules were ranked with respect to 
Kiendl's statistical relevance index, and the resulting rule set was optimized by pruning. The rule sets were inferred 
separately from data of one of three centers and applied to the two remaining centers for validation. All rules from 
the optimized rule sets of all centers were used to analyze their biological relevance applying the software Pathway 
Studio. 

Results: The optimized rule sets for the three centers contained a total of 29, 20, and 8 rules (including 10, 8, and 4 
rules for 'RA'), respectively. The mean sensitivity for the prediction of RA based on six center-to-center tests was 
96% (range 90% to 100%), that for OA 86% (range 40% to 100%). The mean specificity for RA prediction was 94% 
(range 80% to 100%), that for OA 96% (range 83.3% to 100%). The average overall accuracy of the three different 
rule-based classifiers was 91% (range 80% to 100%). Unbiased analyses by Pathway Studio of the gene sets obtained 
by discrimination of RA from OA and GG with rule-based classifiers resulted in the identification of the 
pathogenetically and/or therapeutically relevant interferon-gamma and GM-GSF pathways. 

Conclusion: First-time application of rule-based classifiers for the discrimination of RA resulted in high performance, 
with means for all assessment parameters close to or higher than 90%. In addition, this unbiased, new approach 
resulted in the identification not only of pathways known to be critical to RA, but also of novel molecules such as 
serine/threonine kinase 10. 



Introduction 

Rheumatoid arthritis (RA) and osteoarthritis (OA) are the 
most common forms of arthritis [1]. In spite of different 
pathogeneses, these arthritides exhibit phenotypic simi- 
larities and overlapping cellular and molecular characteris- 
tics [1,2]. RA is a progressive, chronically inflammatory, 
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destructive joint disease of still unknown etiology, perpet- 
uated by an invasive synovial membrane (also known as 
pannus tissue) [3]. Various activated or semi-transformed 
cell types in the synovial membrane (monocytes/macro- 
phages, osteoclasts, T cells and B cells, dendritic cells and 
endothelial cells, synovial fibroblasts) contribute to the de- 
velopment and progression of RA by secretion of proin- 
flammatory cytokines and tissue-degrading proteases [4,5]. 
Similarly, OA is characterized by progressive destruction 
of cartilage and bone and dysregulation of synovial func- 
tion [6]. OA arises from the damage of articular cartilage 
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induced by physical injury and is subsequently influenced 
by a variety of intrinsic (for example, genetic, cellular, or 
immunologic) factors [7]. The OA synovial membrane 
also shows an inflammatory component, although clearly 
less pronounced than in RA [2,7]. 

Compatible with these similarities, the synovial tissue of 
OA and RA patients contains mesenchymal precursor cells 
and attempts to regenerate damaged cartilage and subchon- 
dral bone in the adult organism. In contrast to fetal healing, 
however, the synovial tissue may require inflammation to 
sustain and control the fibroproliferation [8]. 

Although these overlapping features have led to the de- 
velopment of pharmacological or surgical therapies effect- 
ive in both diseases [9-12], the similarities at the same 
time impede a reliable discrimination of the two arthriti- 
des. Diagnostic methods classically include radiography 
[13], histopathological assessment of synovitis [14], detec- 
tion of rheumatic nodules, selected laboratory values such 
as rheumatoid factor and citrullinated peptides [15,16], 
and evaluation of the patients' individual and family his- 
tory [17]. Recently, an improved ultrasound-based scoring 
system has also been proposed [18]. In general, American 
College of Rheumatology criteria for RA [15,19] or for OA 
[16] are often used for diagnostic purposes, although they 
were originally intended as classification criteria, for ex- 
ample, for the comparison of cohorts in different clinical 
studies [20]. However, an appropriate discrimination of 
RA and OA is particularly difficult at later stages of the 
diseases, and the recent revision of the respective criteria 
has not significantly improved their diagnostic capability 
[20]. For instance, the presence of rheumatoid factor as a 
marker for RA has been questioned due to its high vari- 
ability and should be replaced by the level of anti- 
citrullinated protein antibodies [21]. 

An easier discrimination of different forms of arthritis 
has been attempted by molecular approaches, in particu- 
lar, disease-specific gene expression profiling. These at- 
tempts have partially focused on the expression of 
selected candidate molecules with a known influence on 
the respective diseases; for example, type I interferon 
family members [22,23], tumor necrosis factor superfam- 
ily and bone morphogenetic protein family members 
[24], citrullinated synovial proteins [25], and proteases 
such as metalloproteinases or cathepsins [26]. Although 
these studies have indicated the existence of individual 
or combined biomarkers for RA, the validity of this ap- 
proach has not been universal. Some of the studies have 
succeeded in discriminating RA from normal controls, 
but not from other arthritides, while other studies have 
successfully discriminated RA from other forms of arth- 
ritis (such as spondylopathy or psioriatic arthritis), but 
not from OA [24]. 

In parallel to candidate gene analyses, broader, un- 
biased genome-wide gene expression profiles [27] have 



been used to identify disease-specific signatures and hid- 
den biomarkers in rheumatology with microarray-based 
methods [28]. This has been applied to discriminate 
early versus late RA [29] and to discriminate RA versus 
OA [30,31]. In addition, differentially expressed genes 
have been successfully used to predict the response of 
RA patients to therapeutic approaches, for example, the 
capability of certain (type I interferon-responsive) genes 
to predict rituximab nonresponders [32] and anti-tumor 
necrosis factor nonresponders [33] or to define homoge- 
neous subgroups within a heterogeneous disease such as 
RA [22]. However, most studies were not designed to 
identify gene expression patterns as a potential diagnos- 
tic tool, but rather to elucidate the underlying transcrip- 
tional networks [34]. The validity of the identified genes 
as markers for RA or OA was generally also not validated 
in replication cohorts. Finally, differentially expressed re- 
spectively regulated genes or pathways common to RA 
and OA remain a major challenge [30]. 

These obstacles may be overcome using microarray data 
from several analytic centers to identify sets of differen- 
tially expressed genes for the reliable diagnosis of different 
arthritides. For this purpose, bioinformatic methods 
suitable to process and interpret the large amounts of 
high-dimensional data, and also algorithms for the identi- 
fication of rules concerning the expression of disease- 
specific genes, are of utmost importance [35]. 

In personalized medicine and theranostics, the gener- 
ation of decision rules is a well-established method for 
the design of clinical decision support systems and/or 
for the discovery of relevant relationships among patho- 
genetically relevant genes in large databases [36,37]. This 
approach is intended to identify strong rules using differ- 
ent measures of so-called interestingness, for example, 
specificity for a certain disease entity. To select interest- 
ing rules from the set of all possible rules, constraints on 
various measures of significance can be used, such as 
thresholds on support and confidence. In our hands 
[38], the relevance index introduced by Kiendl and co- 
workers [39-43] is able to generate robust rule sets with 
high predictive strength from data of high dimension 
(for example, number of genes) but of low sample num- 
ber. A deterministic decision rule R r is defined by IF P r 
(y) THEN C r \ where P r describes a premise evaluating 
the observations y (that is, the enhanced expression of a 
given gene) and C r is the set of possible conclusions (for 
example, the prediction of a disease status of a given in- 
dividual). In the present work, C r is a categorical variable 
defined by the set of three clinical states {'CG' - control 
group, 'RA - rheumatoid arthritis, 'OA - osteoarthritis} 
and each premise P r is defined by the expression of only 
one gene (uniconditional rules). 

This rule-oriented approach may represent a more 
suitable alternative to the widely used identification of 
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differentially expressed genes to generate a sorted list of 
candidate genes of interest. The approach thus combines 
three major advantages: i) by avoiding the application of 
differentially expressed genes, it is more robust in its dis- 
criminative capacity to data heterogeneity among different 
donors or patients; ii) due to separate normalization and 
independent rule set generation, it is capable of eliminat- 
ing center-specific effects, thus yielding higher sample 
sizes in study cohorts; and iii) cross-validation among dif- 
ferent clinical centers is possible, independently of individ- 
ual differentially expressed genes. 

In this study, three multicenter genome-wide tran- 
scriptomic datasets from 79 individuals were used to 
infer rule-based classifiers to discriminate RA, OA, and 
healthy controls. The rule sets were inferred separately 
from one center and were applied to the other centers 
for validation. This novel approach resulted in high per- 
formance (close to 90% for specificity, sensitivity, and ac- 
curacy) for the discrimination of RA. Unbiased analysis 
of the biological relevance of the underlying rules by 
Pathway Studio (Elsevier, Munich, Germany) and gene 
enrichment analysis succeeded in identifying pathways 
with pathogenetic or therapeutic relevance in RA. 

Materials and methods 

Patients 

Synovial membrane samples were obtained either from 
postmortem joints/ traumatic joint injury cases (control 
group (CG); n = 15 and n = 5, respectively) or from RA/OA 
patients (all Caucasian) upon joint replacement/synovect- 
omy at the Jena University Hospital, Chair of Orthopedics, 
Waldkrankenhaus 'Rudolf Elle,' Eisenberg, Germany (n = 33, 
dataset 'Jena), at the Department of Orthopedics/Institute 
of Pathology/Department of Rheumatology and Clinical 
Immunology, Charite-Universitatsmedizin Berlin (n = 30, 
dataset 'Berlin'), and at the Department of Orthopedics/ 
Institute of Pathology, University of Leipzig (n = 16, dataset 
'Leipzig'). After removal, tissue samples were frozen and 
stored at -70°C. 

The study was approved by the respective ethics 
committees (Jena University Hospital: Ethics Committee 
of the Friedrich Schiller University Jena at the Medical 
Faculty; Charite-Universitatsmedizin Berlin: Charite 
Ethics Committee; and University of Leipzig: Ethics 
Committee at the Medical Faculty of the University of 
Leipzig) and informed patient consent was obtained. RA 
patients were classified according to the American 
College of Rheumatology criteria valid in the sample 
assessment period [15], OA patients were classified 
according to the respective criteria for OA [16]. The 
patients/donors were assigned to one of the three terms 
(categorical values): 'CG', 'RA, or 'OA (for clinical 
characteristics of the donors/patients, see Table 1). 



Data 

Data for 79 patients/donors were obtained from three 
clinical groups located in Jena, Berlin, and Leipzig, re- 
spectively, as presented in Table 2. 

Isolation of total RNA 

Tissue homogenization, total RNA isolation, and treat- 
ment with RNase-free DNase I (Qiagen, Hilden, Germany) 
were performed as described previously [44]. 

Microarray analysis 

Gene expression was analyzed using HG-U133 A/B RNA 
microarrays (Affymetrix, Santa Clara, CA, USA) for the 
datasets 'Jena , 'Berlin, and 'Leipzig' - a total of 79 micro- 
arrays. Labeling of RNA probes, hybridization, and wash- 
ing were carried out according to the supplier's 
instructions. Microarrays were analyzed by laser scanning 
(Gene Scanner; Hewlett-Packard, Palo Alto, CA, USA). 

Pre-processing of microarray data 

Gene expression data were pre-processed by MAS5.0 
(Affymetrix Microarray Suite). The data are accessible 
through Gene Expression Omnibus series [GSE:55235] 
(Haeupl; 'Berlin' data), [GSE:55584] (Stiehl; 'Leipzig' 
data), and [GSE:55457] (Kinne; 'Jena' data). 

For the study group 'Jena_ali; all probe sets independent 
of their Affymetrix 'present call' were used for further ana- 
lysis. For the study groups 'Jena, 'Berlin', 'Leipzig', and 'Total', 
further analyses were restricted to those genes qualified by a 
'present call' in all samples of the respective study group (as 
calculated by MAS 5.0). The data were separately normal- 
ized for the three different study groups 'Jena, 'Berlin', and 
'Leipzig' by dividing the gene expression signals for a given 
gene i and sample/patient ; by the median over all probe sets 
in this sample and were subsequently logarithmized (log 2 ), 
yielding the values y i; % By performing completely independ- 
ent normalization and rule set generation (see Rule set gen- 
eration) in the three different clinical datasets, potential 
problems related to differences in sample preparation and 
wet laboratory conditions were avoided [45]. 

Clustering 

The data were separately clustered for each probe set 
(gene) using a modified fuzzy C-means algorithm and two 
clusters. Here, the fuzzy C-means algorithm [46] was ap- 
plied for the normalized and logarithmized (log 2 ) gene 
expression data (y i; ) of a given gene for every patient be- 
longing to the respective group (that is, 'Jena_all', 'Jena', 
'Berlin', 'Leipzig', or 'Total') to estimate membership de- 
grees (Mijfc) ranging from 0 to 1 for unequivocal assign- 
ment to one of the groups 'low' or 'high' gene expression. 
The centers (CT ik ; CT a < CT i2 ) of the respective gene ex- 
pression clusters (CL ik , k=l for the cluster labeled 'low' 
and k = 2 for that labeled 'high') were also estimated. 
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Table 1 Clinical characteristics of the patients at the time of synovectomy/sampling 



Patients 
(total number) 



Gender Age 
(male/female) (Years) 



Disease RF 
duration (years) (+/-) 



ESR CRP a 
(mm/1 hour) (mg/l) 



Number of ARA- Concomitant 
criteria (RA) medication (n) 



Control group 

(n = 20) 15/5 



Osteoarthritis 

(n = 26) 4/22 

Rheumatoid arthritis 

(n = 33) 8/25 



54.7 ± 4.0 0.3 ± 0.3 d 
(n.d. = 13) 



71.0 ±1.4 7.0 ±1.3 
(n.d.= 1) 

57.0 ±2.7 12.5 ±2.0 



n.d. 



n.d. 



n.d. 



3/18 22.4 ±2.7 

(n.d. = 5) (n.d. = 5) 

21/7 42.7 ±4.5 

(n.d. = 7) (n.d. = 10) 



5.3 ±1.5 0.2 ±0.1 
(n.d. = 3) 

21. 4 ±4.1 5.2 ±0.3 
(n.d. = 3) 



NSAIDs (n=1) 
None (n = 7) 
(n.d. = 12) 

NSAIDs (n=16) 
None (n = 10) 

Prednisolone (n = 23) 
Methotrexate (n = 1 8) 
Sulfasalazine (n = 5) 
Chloroquine (n = 2) 
Leflunomide (n = 2) 
Cyclosporine (n= 1) 
Gold (n=1) 
NSAIDs {n = 22) 



For the parameters age, disease duration, ESR, CRP, and number of ARA criteria (RA), means ± standard error of the mean are given; for the remaining parameters, 

numbers are provided. ARA, American Rheumatism Association (now American College of Rheumatology); n.a., not applicable; n.d., not determined; NSAID, 

nonsteroidal anti-inflammatory drug; RA, rheumatoid arthritis; RF, rheumatoid factor; ESR, erythrocyte sedimentation rate; CRP, C-reactive protein. 

a Normal range, <5 mg /I. 

b Disease duration in joint trauma patients. 



Subsequently, a modified membership degree was used 



(At 



ijk ; 



with Mi 



1 and M, 



i;2 



0 if y iJ <CT il ; with 



Miji ' = 0 and M ifl ' = 1 if y i} > CT i2 ; with M ijk ' = M ijk other- 
wise; that is, for all data in between the two centers). 

Rule set generation 

First, all uniconditional rules were generated independently 
for the three different clinical study groups 'Jena, 'Berlin', 
and 'Leipzig' using the formula IF the premise P r is fulfilled 
THEN the conclusion C r is reached'. The premise P r is de- 
fined as follows: the expression of gene i belongs to either 
the cluster labeled 'low' (CL a ) or the cluster labeled 'high' 
(CL i2 ). The three possible conclusions (C r ; that is, in the 
present study the prediction of the clinical status) are 'CG' 
(that is, no 'RA', no 'OA'),'RA', or 'OA'. 

These rules were ranked using the relevance index RI r in- 
troduced by Kiendl and others [39-43]. Here, a rule 'IF P r 
THEN Cr is ranked on the basis of RI r . In this case, RI r rep- 
resents the normalized gap between the confidence interval 

Table 2 Number of clinical samples and transcriptome datasets 



of the conditional probability of the conclusion C r under 
the premise P r and the confidence interval of the (uncondi- 
tional) probability of the conclusion C„ as described in 
Additional file 1. The calculation of the confidence interval 
was done using a significance level alpha s with a default 
value 0.95, and a reduced alpha s for 'Jena, 'Berlin', and 
'Leipzig' in order to generate a sufficient number (>3) of 
rules with RI r > 0 for each conclusion ('CG', 'RA', or 'OA'). 
Next, it was checked and confirmed that alpha s > alpha SrSLn - 
dom> where at least one rule was generated for each of the 
three conclusions using original pre-processed gene expres- 
sion values jip and a random assignment to the individual 
conclusions ('CG', 'RA', and 'OA') in the training set. 



Rule set pruning 

As a result of the primary rule set generation, a ranked 
set of r max (C, S) rules was generated using the criterion 
RI r > 0. 



Study group S 



Control 



Osteoarthritis 



Rheumatoid arthritis 



Total 



Microarray platform 3 



'Jena' 


10 


10 


13 


33 


Affymetrix HG-U133 A 


'Berlin' 


10 


10 


10 


30 


Affymetrix HG-U133 A 


'Leipzig' 


0 


6 


10 


16 


Affymetrix HG-U133 A 


Total' 


20 


26 


33 


79 





a From Affymetrix, Santa Clara, CA, USA. 
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Rule set pruning was then applied in order to 
minimize the numbers of both rules (r opt ) and 'Errors' 
(that is, false assignment to one of the three conclusions; 
for more detail see Application of the rule sets and 
Evaluation of a rule set). The number of rules in each 
rule set was optimized by greedy search with the follow- 
ing constraints: the numbers r opt (C, S) have to be at least 
4 for each conclusion and not higher than the double of 
the minimum number of rules in any of the respective 
rule sets for the three conclusions - that is, r opt {C, S) > 4 
and r opt {C, S) < T mm c {r max {C, S)). 

The purpose of this step was also to generate rule sets 
with a balanced number of rules for the three conclusions. 

Application of the rule sets 

The rule sets for the different conclusions were then ap- 
plied to each sample (patient) ; by voting in order to 
achieve an individual prediction of its clinical status. 

First, each rule IF P r THEN C/ with the premise P r 
(P r = 'the expression y of gene i is assigned to cluster k 
(i.e., "low" or "high")') was weighted by application of 
the aforementioned fuzzy membership degree (W r; = M^ 



'(yij)) to the sample ; (see earlier Clustering). These 
membership weights (W r f, range from 0 to 1, with 1 in- 
dicating an unequivocal prediction of the conclusion C r ) 
were visualized in a heat map for all samples (j) and all 
rules (Figures 1, 2, 3, 4, 5A,B of the respective study 
group). 

Next, the weights WfiCG), WfiOK), and W}('RA') for 
each individual sample ; were calculated by summing up 
the respective membership weights (W r j) over all rules 
(r) belonging to the rule set for a given conclusion. 

Finally, the highest weight was used for the predic- 
tion of the clinical status of each sample (so-called 
'defuzzification'): 

C pre dict_ ; = argmax(WyeCG'), W)('(M'), W,fRA')) 

This procedure is used for prediction of the clinical 
status in both the original training set (y t j) from a given 
study group (for example, 'Jena) and all subsequently an- 
alyzed test sets from other study groups (for example, 
'Berlin and 'Leipzig'). 
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Figure 1 Heatmaps and confusion matrix for the study group 'Jena_all'. Data for the study group 'Jena_all' (that is, utilizing all probe sets) 
were obtained using the Jena patients (10 control group (CG), 10 osteoarthritis (OA), 13 rheumatoid arthritis (RA)) as the training set for the rule 
generation and re-applying the respective rules to the same dataset. (A) Heatmap of the membership weights applying all rules of the primary 
rule set (a = 0.95; 'CG', 45 rules; 'OA', seven rules; 'RA', 27 rules) for the prediction of the clinical status of the different samples; dashed red lines 
indicate the lower limits of the respective pruned lists of rules subsequently applied in (B). (B) Heatmap of the membership weights applying 
pruned lists of rules (a = 0.95; 'CG', 'OA', and 'RA', seven top-ranked rules each) for optimized prediction of the clinical status of the different samples. 
(C) Confusion matrix for the rule set displayed in heatmap (B). TP, true positives; TN, true negatives; FP, false positives; FN, false negatives. 
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Figure 2 Heatmaps and confusion matrix for the study group 'Jena'. Data for the study group 'Jena' (that is, utilizing only the probe sets 
with MAS 5.0 present calls in all samples) were obtained using the Jena patients (10 control group (CG), 10 osteoarthritis (OA), 13 rheumatoid 
arthritis (RA)) as the training set for the rule generation and re-applying the respective rules to the same dataset. (A) Heatmap of the membership 
weights applying all rules of the primary rule set (a = 0.94; 'CG', 31 rules; 'OA', 10 rules; 'RA', 20 rules) for the prediction of the clinical status of the 
different samples; dashed red lines indicate the lower limits of the respective pruned lists of rules subsequently applied in (B). (B) Heatmap of the 
membership weights applying pruned lists of rules (a = 0.94; 'CG', nine top-ranked rules; 'OA', 10 top-ranked rules; 'RA', 10 top-ranked rules) for 
optimized prediction of the clinical status of the different samples. (C) Confusion matrix for the rule set displayed in heatmap (B). TP, true 
positives; TN, true negatives; FP, false positives; FN, false negatives. 



Evaluation of a rule set 

Comparing the predicted conclusions (C_ pre dictj) with 
the observed clinical status (Dj), the numbers of true 
positives {TP), true negatives (77V), false positives (FP) 
and false negatives (FN) were counted individually for 
the three states ('CG\ 'OA', *RA) to set up the confusion 
matrix. The sum of the TP and TN over the three states 
gives a number called 'Hits' and the sum of FN and FP a 
number called 'Errors . The total sum (n- TP+ TN + FP + 
FN) equals the number of samples. 

The following measures were calculated to assess the 
quality of the classification: 

Sensitivity for the classification of RA = TP RA I (TP RA + 
FN R a + FP OA ); all values derived from the column 
clinical status RA in the respective confusion matrix 
Sensitivity for the classification of OA = TP OA I (TP OA + 
FNqa + FPra)'> all values derived from the column clinical 
status OA in the respective confusion matrix 



Specificity for the classification of RA = TN RA I (TN RA + 
FP RA ); with TNra = TN + FN OA + TP OA + FP OA (latter 
value derived from the column clinical status CG) and 
with the value for FP RA representing the sum of the two 
corresponding fields in the columns clinical status CG 
and OA of the respective confusion matrix 
Specificity for the classification of OA = TN OA I (TN OA + 
FP OA ); with TNqa = TN + FN RA + TP RA + FP RA (latter 
value derived from the column clinical status CG) and 
with the value for FP OA representing the sum of the two 
corresponding fields in the columns clinical status CG 
and RA of the respective confusion matrix 
OveraU specificity (RA + OA) = TN/(TN + FP OA + FP RA ); 
all values derived from the column clinical status CG in 
the respective confusion matrix 
Accuracy = (TN + TP OA + TP RA )ln 

The sensitivities were calculated on the basis of the 
numbers from the corresponding columns of the 
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Figure 3 Heatmaps and confusion matrix for the study group 'Berlin 7 . Data were obtained using the study group 'Berlin' (10 control 
group (CG), 10 osteoarthritis (OA), 10 rheumatoid arthritis (RA)) as the training set for the rule generation and re-applying the respective 
rules to the same dataset. (A) Heatmap of the membership weights applying all rules of the primary rule set (a = 0.94; 'CG', 221 rules; 
'OA', four rules; 'RA', 29 rules) for the prediction of the clinical status of the different samples; dashed red lines indicate the lower limits 
of the respective pruned lists of rules subsequently applied in (B). (B) Heatmap of the membership weights applying pruned lists of rules 
(a = 0.94; 'CG', eight top-ranked rules; 'OA', four top-ranked rules; 'RA', eight top-ranked rules) for optimized prediction of the clinical status 
of the different samples. (C) Confusion matrix for the rule set displayed in heatmap (B). TP, true positives; TN, true negatives; FP, false 
positives; FN, false negatives. 



confusion matrix (see above). FN RA represents the 
number of classifications as 'CG' if the ('true') clinical 
state was RA, and FN OA the number of classifications 
as 'CG' if the ('true') clinical state was OA. For the 
study group 'Leipzig', which contains no control 
group ('CG'), FPra represents the misclassifications as 
'RA, if the ('true') clinical status was OA, and FP OA 
represents the misclassifications as 'OA, if the ('true') 
clinical status was RA. 



Identification of biologically relevant molecules 

Functional relations between the genes selected by the rule- 
based approach (total of 57) were screened using Pathway 
Studio (P9, version from 18 February 2013) following iden- 
tification of synonyms in GeneCard (Weizmann Institute of 
Science, Rehovot, Israel. In addition, gene enrichment ana- 
lysis was performed using the tool DAVID [47] to identify 
overrepresented GO-terms or KEEG pathways for the clin- 
ical states 'CG,' 'OA' or 'RA in the dataset 'Total'. 
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Figure 4 Heatmaps and confusion matrix for the study group 'Leipzig 7 . Data were obtained using the study group 'Leipzig' (0 control group 
(CG), six osteoarthritis (OA), 10 rheumatoid arthritis (RA)) as the training set for the rule generation and re-applying the respective rules to the 
same dataset. (A) Heatmap of the membership weights applying all rules of the primary rule set (a = 0.85; 'CG', zero rules; 'OA', 72 rules; 'RA', four 
rules) for the prediction of the clinical status of the different samples; dashed red lines indicate the lower limits of the respective pruned lists of 
rules subsequently applied in (B). (B) Heatmap of the membership weights applying pruned lists of rules (a = 0.85; 'CG', zero top-ranked rules; 
'OA', four top-ranked rules; 'RA', four top-ranked rules) for optimized prediction of the clinical status of the different samples. (C) Confusion matrix 
for the rule set displayed in heatmap (B). TP, true positives; TN, true negatives; FP, false positives; FN, false negatives. 



Results 

In the first step, classifiers that discriminated between 
'RA' patients, 'OA' patients, and healthy controls ('CG') 
were separately trained for each of the study groups and 
were subsequently applied (tested) for the other study 
groups not initially used for training. 

Training of the classifiers 

The significance level alpha s were set to the default 
value of 0.95 for 'Jena_air [ n = 33 patients/samples) and 
for 'Total' (n = 79). For the other study groups, alpha s 
was reduced to 0.94 for 'Jena (n = 33) and 'Berlin (n = 
30) and to 0.85 for 'Leipzig' (n = 16), as described in 
Materials and methods. alpha s thus depended on both 
the sample size n and number m of considered probe 
sets (see below). 



alpha ^random* for which at least one rule was randomly 
generated for each of the three conclusions, was between 
0.01 and 0.10 smaller than the alpha s used for gener- 
ation of the primary rule sets (see Additional file 2 and 
Materials and methods for details). 

The training results obtained for the study group 'Jena_air 
are shown in Figure 1. After primary rule generation, 45, 
seven, and 27 rules were obtained for the clinical states 
'CG', 'OA, and 'RA, respectively (that is, the numbers r max 
('CG,' 'Jena_air), r m ^('OA, 'Jena_air), and r m ^('RA, 
'Jena_air)). The corresponding rule sets are listed in 
Additional file 3. For each rule (r= 1, r max {C> 'Jena_air)) 
and each sample (total of 33 patients; 10 CG, 10 OA, and 13 
RA), the membership weight (W r ; calculated by the fuzzy 
membership degree) is displayed as a heat map in 
Figure 1A. After pruning, seven rules were selected for each 
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Figure 5 Heatmaps and confusion matrix for the study group 'Total'. Data were obtained using the study group Total' (pooled data from 
the three centers; 20 control group (CG), 26 osteoarthritis (OA), 33 rheumatoid arthritis (RA)) as the training set for the rule generation and 
re-applying the respective rules to the same dataset. (A) Heatmap of the membership weights applying all rules of the primary rule set (a = 0.95; 
'CG', 281 rules; 'OA', 25 rules; 'RA', 108 rules) for the prediction of the clinical status of the different samples; dashed red lines indicate the lower 
limits of the respective pruned lists of rules subsequently applied in (B). (B) Heatmap of the membership weights applying pruned lists of rules 
(a = 0.95; 'CG', 21 top-ranked rules; 'OA', nine top-ranked rules; 'RA', 15 top-ranked rules) for optimized prediction of the clinical status of the 
different samples. (C) Confusion matrix for the rule set displayed in heatmap (B). TP, true positives; TN, true negatives; FP, false positives; FN, 
false negatives. 



of the conclusions (Figure IB). Figure 1C and Table 3 dis- 
play the confusion matrix and quality parameters of the 
training results. Except for the sensitivity for OA (90%) and 
the accuracy (97%), all quality parameters reached 100%. 

The following results are restricted to probe sets that 
were qualified by a present call' for all samples of the re- 
spective dataset. In the case of the dataset 'Jena, a num- 
ber m of 7,768 probe sets was considered, for 'Berlin' 
5,159 probe sets, for 'Leipzig' 8,539 probe sets, and for 
'Total' 4,982 probe sets. 



Using the reduced dataset for 'Jena', a total of 61 rules 
was generated (31 rules for 'CG', 10 rules for 'OA', and 
20 rules for 'RA') as shown in Figure 2A. This primary 
rule set was pruned to a set of 29 rules, whose perform- 
ance is displayed in Figure 2B,C. The rule set trained 
with the data of the study group 'Jena' and applied to the 
same dataset resulted in zero errors (Figure 2C) and an 
optimization of all quality parameters to 100% (Table 3). 

The same type of analysis (application of 'present 
calls'; rule set training) was performed for the study 
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Table 3 Optimized number of pruned rules (r opt (C, S)) a and assessment of training results 



Study group 5 'Jena_air 'Jena' 'Berlin' 'Leipzig' 'Total' 



Figure 


1 


2 


3 


4 


5 


Number of rules for 'CG' 


7 


9 


8 


0 


21 


Number of rules for 'OA' 


7 


10 


4 


4 


9 


Number of rules for 'RA' 


7 


10 


8 


4 


15 


Sensitivity for RA (%) 


100 


100 


100 


100 


97 


Sensitivity for OA (%) 


90 


100 


100 


100 


100 


Specificity for RA (%) 


100 


100 


100 


100 


100 


Specificity for OA (%) 


100 


100 


100 


100 


96.2 


Overall specificity (RA + OA) (%) 


100 


100 


100 


n.a. 


95 


Accuracy (%) 


97 


100 


100 


100 


97.5 



CG, control group; n.a., not applicable; OA, osteoarthritis; RA, rheumatoid arthritis. a See Additional file 3. 



groups 'Berlin' and 'Leipzig' (Figures 3 and 4; sum- 
mary in Table 3). Again, rule sets trained in and re- 
applied to the same dataset resulted in zero errors 
(Figures 3C and 4C). For the study group 'Leipzig', 
however, the overall specificity could not be estimated 
due to missing data in the control group ('CG'). Rule 
set training in the pooled 79 samples from the study 
groups 'Jena', 'Berlin', and 'Leipzig' (named study 
group 'Total') resulted in the rules displayed in Fig- 
ure 5 and in only two errors (77 truly classified sam- 
ples; Figure 5C). 



Internal validation of pruned rule sets from the three 
clinical centers by leave-one-out cross-validation and 
bootstrapping resulted in acceptable error rates (see 
Additional file 2). 

Testing of the classifiers 

The classifiers separately trained in the study groups 
'Jena', 'Berlin', and "Leipzig' (see Figures 2, 3 and 4) were 
next applied to the respective other study groups not 
used for training (Table 4). The average accuracy was ap- 
proximately 91%, ranging from 80 to 100%. The mean 



Table 4 Assessment of test results 



Training set from study group 


'Jena' 


'Jena' 


'Berlin' 


'Berlin' 


'Leipzig' 


'Leipzig' 


Test set from study group 


'Berlin' 


'Leipzig' 


'Jena' 


'Leipzig' 


'Jena' 


'Berlin' 


Number of rules for 'CG' 


9 


9 


8 


8 


0 


0 


Number of rules for 'OA' 


10 


10 


4 


4 


4 


4 


Number of rules for 'RA' 


10 


10 


8 


8 


4 


4 


Sensitivity for RA (%) 


100 


100 


92.3 


100 


92.3 


90 


Sensitivity for OA (%) 


40 


100 


90 


83.3 


100 


100 


Specificity for RA (%) 


100 


100 


80 


83.3 


100 


100 


Specificity for OA (%) 


100 


100 


91.3 


100 


92.3 


90 


Overall specificity (RA/OA) (%) 


100 


n.a. 


60 


n.a. 


n.a. 


n.a. 


Accuracy (%) 


80 


100 


81.8 


93.8 


95.6 


95 


Test samples 


30 


16 


33 


16 


23 


20 


Hits for CG 


10 


0 


6 


0 


0 


0 


Hits for OA 


4 


6 


9 


5 


10 


10 


Hits for RA 


10 


10 


12 


10 


12 


9 


Hits total 


24 


16 


27 


15 


22 


19 


Errors for CG 


0 


0 


4 


0 


0 


0 


Errors for OA 


6 


0 


1 


1 


0 


0 


Errors for RA 


0 


0 


1 


0 


1 


1 


Errors total 


6 


0 


6 


1 


1 


1 



CG, control group; n.a., not applicable; OA, osteoarthritis; RA, rheumatoid arthritis. 



Woetzel et al. Arthritis Research & Therapy 2014, 16:R84 
http://arthritis-research.eom/content/16/2/R84 



Page 11 of 21 



sensitivity for the prediction of RA was 96%, ranging 
from 90 to 100%; and that for the prediction of 'OA' was 
86%, ranging from 40 to 100%. 

The number of 'Errors' for the prediction of RA was 
generally extremely small; in three cases ('Jena — > 'Berlin', 
'Jena — > 'Leipzig', and 'Berlin' — > 'Leipzig'), no errors were 
detected; in the remaining cases there was only one error 
each (1/13, 1/13, and 1/10, respectively). 

For the remaining two clinical states (that is, 'CG' and 
'OA') more errors were detected. In the case of 'Jena' — > ' 
Berlin', six OA patients were misclassified as 'CG'; 
whereas in the case of 'Berlin' — > 'Jena', three CG samples 
were misclassified as 'RA' and one CG sample as, OA in 
addition to one OA patient being misclassified as 'RA'. 

Molecular interpretation of the obtained rule sets 

The complete overlap of all rules (that is, premises and 
conclusion) resulting from the comparison of all study 
groups before pruning is shown in Additional files 3 and 
4 (please note the cross-table listing of the overlapping 
genes in Table B of the sheet 'Rule Overlap among Data 
Sets' in Additional file 3). 

If, for the purpose of identifying biologically relevant 
classifiers, the overlap analysis is focused on the three 
independent study groups 'Jena', 'Berlin', and 'Leipzig', a 
list of selected potential 'key' players can be extracted 
(Table 5). 

Whereas no overlap between these groups was found 
for rules with the conclusion 'OA', remarkable overlap 
was found for the conclusions 'CG' and 'RA'. 

The rule 'IF NFIL3 is highly expressed THEN CG' (with 
NFIL3 coding for the nuclear factor interleukin-3-regulated 



protein) was generated with high relevance from both the 
'Jena' and the 'Berlin' datasets (ranked in third and fourth 
position, respectively; Table 5). In addition, the two genes 
MAT2A (methionine adenosyltransferase 2A) and TIPARP 
(2,3,7,8-tetrachlorodibenvzo-p-dioxin (TCDD) -inducible poly 
(ADP-ribose) polymerase) were identified in prominent rules 
for 'CG', each only present in the pruned rule set of one 
study group. 

For the conclusion 'RA', the rules concerning the 'high' 
expression of the genes STAT1, GBP1, PLCG2, CSF2RB, 
and STK10 were highly ranked in pruned rule sets from 
different study groups. STAT1 (signal transducer and ac- 
tivator of transcription 1) was found in the pruned rule 
set 'Berlin' (rank 1), and GBP1 (interferon-inducible gua- 
nylate binding protein 1) in the pruned rule sets 'Jena' 
(rank 2) and 'Berlin' (ranks 2 and 8). PLCG2 (phospho- 
lipase c-gamma-2) was found in the pruned rule set 
'Berlin' (rank 5), and STK10 (serine/threonine kinase 10) 
in the pruned rule set 'Jena' (rank 5). 

Strikingly, the relevance of the rule 'IF CSF2RB is highly 
expressed THEN RA' (CSF2RB coding for the interleukin 
3 receptor/granulocyte-macrophage colony stimulating 
factor 3 receptor, beta was supported by three different 
features: the rule was independently detected in the rule 
sets derived from all three centers ('Jena! 'Berlin', and 
'Leipzig'); the rule occupied the highest rank (rank 1) in 
the rule set from 'Leipzig'; and its complementary rule 'IF 
CSF2RB is low THEN OA' was also detected in the rule 
set 'Leipzig' with rank 3 (see Additional file 3). 

To address a potential pathogenetic role of the genes 
indicated in Table 5, their expression was compared 
among the three different clinical states (both 



Table 5 Overlap between the three independent study groups 


Gene 


Probe set name 


Expression 


'Jena' 


'Berlin' 


'Leipzig' 


'Total' 


symbol 




level 


rule rank 


rule rank 


rule rank 


rule rank 


'CG' 














NFIL3 


203574_at 


High 


3 


4 




1 


JUND 


203752_s_at 


High 


11 


85 




18 


MAT2A 


200768_s_at 


High 


2 


83 




5 


TIPARP 


212665_at 


High 


12 


8 






LEPROTL1 


202594_at 


Low 


27 


127 




113 


'RA' 














STAT1 


M97935_3_at [200887_s_at] 


High 


19 


1 (& 10) 




2 (& 10) 


GBP1 


202270_at [202269_x_at] 


High 


2 


2 (& 8) 




5 (& 6) 


PSMB9 


204279_at 


High 


13 


17 




1 


PLCG2 


20461 3_at 


High 


14 


5 




4 


LY75 


205668_at 


High 


12 


26 




8 


CSF2RB 


205159_at 


High 


17 


28 


1 


3 


STK10 


40420_at 


High 


5 


21 




12 



The genes belonging to at least one of the pruned rule sets of the three independent study groups are highlighted in bold, genes/rules detected by two different 
probe sets are indicated by numbers in parentheses. 
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individually for the three different clinical centers and 
for the pooled study group 'Total' derived from all cen- 
ters). In support of their relevance, all genes/rules char- 
acterizing 'CG' were significantly overexpressed in CG as 
compared with both RA and OA (Additional file 5) - 
with the exception of the gene/rule LEPROTL1 (leptin 
receptor overlapping transcript 1), which also showed 
significant differences, but with an opposite orientation 
(all P < 0.05; Mann Whitney U test). 

Strikingly, all genes/rules identified for RA also ap- 
peared highly discriminative, as shown by a significant 
overexpression in RA in comparison with both CG and 
OA (P values between 10~ n and 0.05 for 41/42 compari- 
sons; P = 0.056 for the remaining comparison; Additional 
file 5). 

In addition to the analysis of the overlapping rules, all 
57 rules generated from the different study groups after 
pruning - that is, 29 rules trained from the dataset 'Jena, 
20 from 'Berlin', and eight from 'Leipzig' (highlighted in 
the complete rule set in Additional file 3) - were screened 
for functional relations using Pathway Studio following 
identification of synonyms in GeneCard. 

Since for three Affymetrix probe sets no gene names 
were identified (see Additional file 6), only 54 genes 
were analyzed using Pathway Studio. The results of the 
Pathway Studio search for the conclusions 'CG' and 'RA' 
are shown in Additional files 7 and 8, respectively. 

Again, no relations were found for the conclusion 'OA'. 
For 'RA', instead, three relations were found (Table 6). In 
addition to the well-known relation JAK2 — > STAT1, 
which regards various cell types including fibroblasts (total 
of 70 references named by Pathway Studio), the relation 
STAT1 -> GBP1 [48-50] and the relation JAK2 -> CSF2RB 
[51-53] have only been addressed by a limited number of 
publications. 

Please note that JAK2 is not contained in Table 5 since 
it was only detected in the rule set for 'RA' in the study 
group 'Jena' (rank 3). 

Gene enrichment analysis for molecular interpretation 
of the obtained rule sets resulted in additional information. 
In CG, for example, there was low expression of genes in- 
volved in MHC class II antigen processing/presentation 

Table 6 Interactions between the premises/genes of the 
pruned rule sets generated from the 'Jena', 'Berlin', and 
'Leipzig' data sets (total of 57 rules), as found by 



Pathway Studio and exemplified for the conclusion 'RA' 



Relation 


Type 


Cell type 


Number of 
references 


JAK2 -> STAT1 


Promoter binding 


Various 


(70) 


JAK2 -> CSF2RB 


Regulation 


Hematopoietic cells 


(3) 


STAT1 -^GBP1 


Protein modification 


Fibroblasts 


(3) 



For more details, see Additional file 8. Pathway Studio from Elsevier, Munich, 
Germany. 



(Additional file 9, sheets 'CG Low BP' and 'CG Low 
KEGG'). In RA, in contrast, there was high expression of 
genes involved in immune response in general and 
leukocyte/T-cell/B-cell activation (Additional file 10, sheets 
'RA High BP' and 'RA High KEGG'), as well as pro- 
grammed cell death (Additional file 10, sheets 'RA High 
BP7RA High KEGG! and 'RA Low BP'). 

As already observed for the sensitivity and accuracy, as 
well as the rule overlap and molecular interpretation, 
OA patients were again more difficult to discriminate, as 
indicated by the almost complete absence of indicative 
GO terms or KEGG pathways in gene enrichment ana- 
lysis (Additional file 11). 

Discussion 

In the present study, three multicenter, genome-wide 
transcriptomic datasets from a total of 79 individuals 
were used to infer rule-based classifiers to discriminate 
RA, OA, and healthy controls. In all cases, the rule sets 
were inferred separately from one of three centers and 
applied to the other centers for validation. This novel 
approach resulted in a high performance (close to 90% 
for specificity, sensitivity, and accuracy) for the discrim- 
ination of RA. Unbiased analysis of the biological rele- 
vance of the underlying rules by Pathway Studio resulted 
in the identification of pathways with known pathogen- 
etic or therapeutic relevance in RA. In addition, serine/ 
threonine kinase 10 (lymphocyte-oriented kinase) was 
identified as a novel molecule with a potential role in 
RA. Yet another novel contribution of the present study 
is the identification of molecules that identify normal 
synovial tissue, an aspect barely addressed to date. 

New approach for the identification of discriminating 
genes and/or rules 

A novel rule-based approach was used to identify genes 
(in combination with their expression status) suitable for 
the discrimination of the clinical states healthy controls 
('CG'), 'OA, and 'RA. This approach has the major 
advantage of skipping the identification of differentially 
expressed genes on the basis of fold changes and/or 
£-test or £/-test analysis, a process highly sensitive to het- 
erogeneity in the patient data and therefore often incap- 
able of identifying relevant disease-specific genes. 

The rule-based approach applied in the present study 
is based on the relevance index of Krone and Kiendl 
[40]; this relevance index has so far only been used for 
rule generation in electrical control engineering [41] or 
biotechnology [38]. In addition, there are only few exam- 
ples for the application of this relevance index to omics 
data (for example [54]) and, to our knowledge, none for 
the application to data in the rheumatology field. 

Rule set pruning, applied in order to minimize the 
numbers of both rules and 'Errors', was successfully used 
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to avoid overfitting and informative imbalance [55]. 
From our experience with heuristic rules, at least four 
rules per conclusion were required [38,55]. 

Quality parameters of the training results 

For the datasets 'Jena , 'Berlin', and 'Leipzig', the values 
for disease-oriented sensitivity and specificity, overall 
specificity, and accuracy were all 100%. This high per- 
formance for the training of the classifiers was expected, 
but still shows that this approach is suitable for the ana- 
lysis of gene expression data from synovial tissue. 

Interestingly, the disease-specific sensitivity for OA in 
the 'Jena_all' dataset was only 90%, resulting in an accur- 
acy of 97% (see Table 3), whereas the quality parameters 
in the 'Jena' dataset all reached 100%. This is probably 
due to the highly stringent approach of only using probe 
sets with a 'present call' in all samples, deliberately 
chosen to minimize false positives. This approach is fur- 
ther supported by reduced error rates in the internal val- 
idation of the "Jena dataset in comparison with the 
'Jena_all' dataset (see Additional file 2). 

The results for the quality parameters in the largest 
possible dataset 'Total', containing 19 CG, 26 OA, and 
32 RA, also proved highly satisfactory; that is, >95%. 
This further underlines the suitability of the relevance 
index approach for large-scale clinical studies with high 
numbers of RA and OA patients [27,30]. 

Quality parameters of the test results 

The quality parameters of the test results for the predic- 
tion of RA were also highly satisfactory; that is, they 
showed a mean close to or higher than 90% for all assess- 
ment parameters (see Table 4). This shows that the real 
challenge of the present study - that is, the prediction of 
RA in test datasets independent of the training dataset - 
can be met with a high accuracy and may indeed contrib- 
ute to the identification of biomarkers for RA. 

Notably, the mean sensitivity and specificity for the 
prediction for OA was considerably lower than for RA, 
due to both misclassification of OA as 'CG' (six cases) or 
as 'RA' (two cases). This is consistent with the clinical 
problem of properly differentiating burnt-out, possibly 
more heterogeneous, OA with low inflammatory activity 
from normal controls on one hand, and active, highly in- 
flammatory OA from RA on the other [1,2]. 

Molecular interpretation of the obtained rule sets 

The number of studies aimed at identifying disease- 
specific signatures in rheumatology with microarray- 
based methods is limited [30,31,35,56-60]. Also, very few 
datasets addressing this question are publicly available 
and have been repeatedly used for bioinformatic ana- 
lyses. In addition, with one exception [57], these studies 
have not analyzed matched multicenter datasets for 



rheumatic diseases. Finally, studies have resulted in the 
identification of numerous and heterogeneous biomarker 
genes or pathways with only limited overlap among the 
results of the different studies. 

In the present study, in contrast, several rules were 
identified in more than one rule set generated in the 
three clinical centers; that is, five rules for the prediction 
of healthy controls (CG) and seven rules for the predic- 
tion of RA (see Table 5). Notably, a total of seven of 
these rules were represented not only in the primary rule 
set of the centers, but also in one or more of the re- 
spective pruned rule sets. Strikingly, no overlapping 
rules were observed for 'OA', again underlining the 
problem of properly differentiating OA from either CG 
or RA (see above for the Quality parameters of the test 
results). 

In addition, automated analysis of interactions by Path- 
way Studio between the molecules identified in the union 
of all optimized rule sets (total of 57 rules; derived from 
three clinical centers with either two or three disease 
states) resulted in three interactions supported by at least 
three references; that is, JAK2 — > STAT1 (70 references), 
STAT1 -> GBP1 (three references) and JAK2 -> CSF2RB 
(three references; see Table 6). Please note that JAK2 was 
only detected once at rank 3 in the 'Jena' rule set (see 
Additional file 3) and is therefore not listed in Table 5. 

Rules for the prediction of healthy controls (CG) 

The genes identified above as overexpressed in CG may 
represent a core set of markers of healthy tissue and re- 
flect regulatory genes specifically involved in the down- 
regulation/prevention of rheumatic diseases (that is, OA 
or RA). 

Nuclear factor interleukin-3-regulated protein 

NFIL3 is a basic leucine transcription factor acting as a 
regulator of genes associated with acquired and innate im- 
munity (for example, interleukin (IL)-3 and interferon- 
gamma (IFNy) [61]) or with the inhibition of proliferation 
and senescence [62]. In particular, NFIL-3 negatively regu- 
lates IL-12 p40 in macrophages and dendritic cells [63,64] 
and suppresses TH2 cytokines [65], as well as the develop- 
ment and maturation of NK cells [66]. In addition, NFIL3 
exhibits anti-apoptotic features [67]. In particular, the role 
of NFIL3 in limiting the production of proinflammatory 
IL-12 may explain its upregulation in the normal CG. On 
the other hand, its prominent influence on essential cellular 
features (for example, metabolism, growth, viability) points 
to a potential contribution to the pathogenesis of RA (and/ 
or OA) in the case of dysregulated underexpression. 

Jun D proto-oncogene 

Members of the JUN and FOS families are known as 
immediate-early response proto-oncogenes, since they 
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are rapidly induced by various activating agents and, on 
the other hand, have a very short half-life (in the range 
of minutes for both mRNA and protein) [68]. As in the 
case of NFIL3, the transcription factor JunD also regu- 
lates genes involved in acquired and innate immunity 
[69], in proliferation and senescence [70], or in anti- 
apoptotic effects [71,72]. 

Individual JUN/FOS family molecules show different 
biological activities. Whereas C-JUN and C-FOS are 
known as activating proto-oncogenes with transforming 
activity [73,74], JUND also shows de-activating features 
[68,73,75-77]. The effects of AP-1 complexes composed 
of different JUN/FOS family members clearly depend on 
the local promoter context of genes driven by AP-1 (for 
example, MMP-1 [78,79]). JUND suppresses synovial fibro- 
blast proliferation and even antagonizes Ras-mediated 
transformation of the fibroblasts [77], and thus its overex- 
pression may exert a protective role in the synovial mem- 
brane of normal joints. 

Methionine adenosyltransferase 2A 

The importance of the overexpression of MAT2A in CG 
samples is presently unclear. This molecule is involved in 
the regulation of basic cellular functions, such as the syn- 
thesis of polyamines (thought to play a role in nucleic acid 
and protein synthesis) and developmental processes [80]. 

2,3,7,8-tetrachlorodibenzo-p-dioxin-inducible 
poly(ADP-ribose) polymerase (TIPARP) 

Poly(ADP-ribosyl)ation physiologically contributes to the 
survival of damaged proliferating cells by immediate, 
DNA damage-dependent post-translational modification 
of histones and other proteins in the nucleus. By this 
process, poly(ADP-ribose) polymerases are involved in cel- 
lular functions such as proliferation and cell death. It is to 
be expected that the growing poly(ADP-ribose) polymer- 
ase superfamily may become the target of pharmacological 
strategies enhancing both antitumor efficacy and the treat- 
ment of a number of inflammatory and neurodegenerative 
disorders [81]. 

TiPARP (PARP-7) was originally identified by differen- 
tial display as a TCDD-induced mRNA [82]. Although 
the exact function of TiPARP is presently unclear, its 
effects on T-cell function and its possible contribution 
to tumor promotion suggest a role also in the normal or 
arthritic synovial membrane [81]. 

Leptin receptor overlapping transcript-like 1 

The leptin receptor overlapping transcript (also called 
OB-RGRP [83]) is one of the three members of a gene 
family [84,85]. Leptin receptor overlapping transcript 
molecules are small proteins of 131 to 140 amino acids, 
carrying four potential transmembrane domains. 



LEPROTL1, a gene widely expressed in human tissues, 
including metabolic tissues such as muscle and liver 
[83,84,86], has an influence on growth, plasma insulin- 
like growth factor- 1 levels, hepatic sensitivity to growth 
hormone, and cell-surface growth hormone or leptin re- 
ceptor expression and leptin signaling [87,88]. 

The high importance of LEPROTLlin protein trafficking 
to the vacuole/lysosome of eukaryotic cells, a process ini- 
tially regarded as pathogenetically relevant in RA [89-91], 
and in the downregulation of membrane protein levels 
suggests a phylogenetically conserved role for LEPROTL1 
[85]. Since LEPROTL1 does not appear to act as a classic 
leptin receptor, its role in the physiology and pathophysi- 
ology of the synovial membrane is presently uncertain. 

In the present dataset, the above-mentioned NFIL3, 
JUND, MAT2A, and TIPARP were indeed significantly 
overexpressed in the synovial membrane of CG as com- 
pared with both RA and OA (both individually for the 
three different clinical centers and for the pooled study 
group 'Total' derived from all centers; Additional file 5). 
Interestingly, overexpression of JUND (OA vs. RA) has 
not only been observed in synovial membranes, but also 
in proinflammatory synovial fibroblasts isolated from 
synovial tissue [92]. 

In contrast, LEPROTL1 was the only gene significantly 
underexpressed in the synovial membrane of CG as 
compared with both RA and OA, suggesting that this 
molecule may support inflammatory and/or degenerative 
joint diseases. Similarly to JUND, however, in an oppos- 
ite direction, differential expression of LEPROTL1 was 
not only observed in synovial membranes, but also in 
resident synovial fibroblasts [92]. 

Rules for the prediction of rheumatoid arthritis 

The genes overexpressed in RA synovial tissue (see 
Table 5) may represent biomarkers of RA and reflect 
processes specifically involved in the pathogenesis and/ 
or progression of the disease. A disease specificity of the 
markers is strongly supported by their significant over- 
expression in RA, not only in comparison with CG but 
also with the 'disease' control OA (see Additional file 5). 
In the RA groups, genes especially associated with the 
regulation of immunologic processes appear to be suit- 
able as disease-specific identifiers. 

Signal transducer and activator of transcription 1 

STAT1, a transcription factor regulating (amongst others) 
immunity-mediating genes, is known to be upregulated in 
RA patients [59,93]. In addition to other transcription fac- 
tors (for example, NFKB or AP-1), STAT1 has long been 
regarded as a pivotal transcription factor involved in joint 
inflammation and destruction [60,94]. The identification 
of these key factors underlines the robustness of the 
present approach. This is further underlined by the fact 
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that the rule 'STAT1 high in RA appears a total of five 
times in three different rule sets (rule set 'Jena, position 
19; rule set 'Berlin, positions 1 and 10; rule set 'Total', 
positions 2 and 10; see Table 5 and Additional file 4 for 
details and the corresponding Affymetrix probe sets). 

In addition, there was a reciprocal detection of the 
complementary rule IF STAT1 is low THEN OA in the 
rule set 'Leipzig' with rank 12 (see Additional file 3). 

Interferon-inducible guanylate binding protein 1 

GBP1, a protein specifically binding guanylated nucleo- 
tides with potential effects on GTPases involved in signal 
transduction, has been already described as upregulated 
in RA versus OA synovial tissue [95] . Also, this factor is 
implicated in the pathogenesis of RA due to its upregu- 
lation by IFNy [95,96]. As in the case of STAT1, this 
finding confirms that key mediators of rheumatoid in- 
flammation have been identified in the present study. 
This is again further underlined by the fact that the rule 
'GBP1 high in RA appears a total of five times in three 
different rule sets (rule set 'Jena, position 2; rule set 
'Berlin, positions 2 and 8; rule set 'Total', positions 5 
and 6; see Table 5 and Additional file 4). 

Proteasome (prosome, macropain) subunit, beta type, 9 
(large multifunctional peptidase 2/low molecular mass 
protein 2) 

The proteasomal subunit PSMB9 (also known as LMP2; 
see abbreviations) is involved in the regulation of proteo- 
lytic specificity, especially in response to IFN-y, thus en- 
abling the formation of immunoproteasomes and the 
generation of peptides presentable by MHC I molecules 
[97]. PSMB9 also enhances cytokine production (for ex- 
ample, tumor necrosis factor, IL-lp, IFNy [98]). Indeed, 
this molecule shows a significant genetic association 
with RA in ethnic Han Chinese from Yunan [99] and is 
the target of autoimmune reactions in RA [100]. As for 
STAT1 and GBP1, the validity of the rule 'PSMB9 high 
in RA is emphasized by the fact that it appears in three 
different rule sets (rule set 'Berlin', position 13; rule set 
'Leipzig', position 17; rule set 'Total', position 1; see 
Table 5 and Additional file 4). 

Phospholipase C-gamma-2 

The function of members of the phospholipase C family is 
the hydrolysis of phospholipids into fatty acids and other 
lipophilic molecules. The family members are grouped into 
several subtypes and catalyze the hydrolysis of phos- 
phatidylinositol 4,5-bisphosphate to inositol 1,4,5-trisphos- 
phate and 1,2-diacylglycerol, which both have important 
second messenger functions. Phospholipase C-gamma is 
activated by phosphorylation in response to various growth 
factors or immune signals, is broadly expressed, and 
carries diverse biological functions in inflammation, cell 



growth, signaling/death, and maintenance of membrane 
phospholipids. Activating mutations in the PLCG2 gene 
have been shown to induce autoimmunity, inflammation, 
and/or inflammatory arthritis in murine models [101,102]. 
PLCG2 has already been recognized as an excellent dis- 
criminator of RA against other types of arthritis or auto- 
immune diseases [103] and appears to be significantly 
upregulated in RA synovial tissue as compared with the 
normal synovial membrane [104]. As for STAT1, GBP1, 
and PSMB9/LMP2, the validity of the rule 'PLCG2 high in 
RA' was emphasized by its appearance in three independ- 
ently established rule sets (rule set 'Berlin', position 5; rule 
set 'Jena, position 14; rule set 'Total', position 4; see Table 5 
and Additional file 4). 

Lymphocyte antigen 75 

Ly75, a member of the human macrophage mannose recep- 
tor family (also known as DEC205 or GP200-MR6), sup- 
ports antigen presentation of dendritic cells [105] and 
mediates anti-proliferative as well as promaturational ef- 
fects in B lymphocytes [106]. This molecule is apparently 
upregulated in monocytes derived from RA patients in 
comparison with those from normal donors [107], but its 
role in disease is currently unknown. Interestingly, however, 
single nucleotide polymorphisms of the Ly75 antigen be- 
long to the three single nucleotide polymorphisms most 
significantly associated with type 2 diabetes mellitus, leaving 
open a possible role of Ly75 in inflammatory disease [108]. 

CSF2RB (interleukin 3 receptor/granulocyte macrophage 
colony stimulating factor 3 receptor, beta) 

A most striking finding in the present study is the rule 
'CSF2RB high in RA. CSF2RB codes for a transmem- 
brane protein and acts as a common receptor subunit 
(also known as common beta chain) for granulocyte- 
macrophage colony-stimulating factor (GM-CSF), IL-5, 
and IL-3, which play a preeminent role in inflammation 
and hematopoiesis [109,110]. One of the ligands of 
CSF2RB (that is, GM-CSF) has long been implicated in 
the pathogenesis of RA, and other rheumatic or auto- 
immune diseases [60,111-119]. This has recently led to 
the development of neutralizing therapeutic monoclonal 
antibodies specifically directed against the a-chain of the 
GM-CSF receptor, which have been successfully used for 
the treatment of RA [120-122]. 

Notably, the rule 'CSF2RB high in RA appeared in the 
independently established rule sets of all analyzed cohorts 
(rule set 'Jena', position 17; rule set 'Berlin', position 28; 
rule set 'Leipzig', position 1; and, remarkably, rule set 
'Total', position 3), again underling the validity of the 
completely unbiased procedure of rule set generation. As 
in the case of STAT1, there was a reciprocal detection of 
the complementary rule 'IF CSF2RB is low THEN OA in 
the rule set 'Leipzig' with rank 3 (Additional file 3). 
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Serine/threonine kinase 10 (lymphocyte-oriented kinase) 

STK10 is a member of the Ste20 family of serine/threonine 
protein kinases with similarity to several known polo-like 
kinase kinases [123], which associates with and phosphory- 
lates polo-like kinase 1 and whose functional inactivation 
interferes with normal cell cycle progression. STK10 also 
negatively regulates IL-2 expression in T cells via the 
mitogen-activated protein kinase kinase 1 pathway [124]. 
Interestingly (and potentially relevant for RA), STK10 is 
involved in the regulation of cytoskeletal rearrangement 
through phosphorylation of the ezrin-radixin-moesin 
proteins [125], a process also strongly emphasized by a 
previous report [96] and by a relatively low expression of 
the respective genes in the gene enrichment analysis in 
the 'CG' group (see Additional file 9; sheet 'CG low BP'). 
In addition, STK10 potentiates dexamethasone-induced 
apoptosis [126] and may thus contribute to the dysregula- 
tion of apoptosis possibly involved in RA [127]. Finally, 
STK10 may play a role in autoimmune skin diseases [128], 
although a direct involvement of this molecule in arthritis 
has never been reported. 

As in the case of rules for healthy control (CG), all genes 
used for the prediction of RA were indeed significantly 
overexpressed in the synovial membrane of RA as com- 
pared with both OA and CG (both individually for the 
three different clinical centers and for the pooled study 
group 'Total' derived from all centers; see Additional file 5). 
Interestingly, highly significant overexpression of CSF2RB 
(RA vs. OA; P = 5.4 x 10" 6 ) was not only observed in syn- 
ovial membranes, but also in proinflammatory synovial fi- 
broblasts isolated from synovial tissue [92]. 

Finally, in combination with JAK2, one of the most in- 
fluential rules in the 'Jena RA group (position 3; high in 
RA), a subset of the genes (STAT1, GBP1, CSF2RB) can 
be combined in a JAK/ STAT- dependent gene regulatory 
network [59,60,129-131]. This also indicates that the 
rules identifying RA patients in the present study are 
not generated randomly, but reflect a mechanistic rele- 
vance within the context of RA pathogenesis. Concern- 
ing JAK2, its concrete relevance in RA is stressed by the 
development of therapeutic approaches directed at the 
JAK pathway [129]. 

Overall, the present study confirmed the involvement 
of partially or well-known molecules/pathways in RA 
(for example, STAT1, GBP1, PLCG2, CSF2RB), but also 
identified molecules previously not associated with RA 
(for example, STK10). Also, to our knowledge, there are 
at present no reports on molecules /pathways positively 
identifying the clinical status 'CG' in general, and the 
NFIL-3 pathway in particular. Finally, the present study 
presents for the first time a unifying hypothesis' by ad- 
dressing the overlap of the highly ranked rules/genes 
among different clinical centers and thus pinning down 
molecules of universal relevance in heterogeneous 



patient cohorts from different centers. This is also sup- 
ported by the representation of the top 12 rules of the 
'Total' dataset in the overlap table; that is, the largest in- 
dependently analyzed cohort in the present study (total 
of 79 donors (patients). 

Conclusions 

In this study, three multicenter, genome-wide transcrip- 
tomic datasets were applied to infer rule-based clas- 
sifiers/genes to discriminate RA, OA, and healthy 
controls, and were subsequently analyzed for their bio- 
logical relevance using Pathway Studio and gene enrich- 
ment analysis. This novel approach resulted in a high 
performance for the discrimination of RA and the iden- 
tification of factors with known pathogenetic or thera- 
peutic relevance in RA (for example, STAT1, GBP1, 
IFNy, GM-CSF, and its receptor CSF2RB, as well as 
JAK2, the latter pointing to a JAK/STAT-dependent 
gene regulatory network). This indicates that the rules 
identifying RA patients were not generated randomly, 
but reflect (disease-specific) key biomarkers with mech- 
anistic relevance for RA pathogenesis and progression, 
some of them well established and already exploited for 
therapeutic purposes. 

The present study contributes to focusing the diagnos- 
tic and therapeutic interest in RA on relevant and in- 
novative molecules or pathways; for example, GM-CSF 
and its receptor CSF2RB. The fact that such known 
pathways have been identified in the present study for 
the prediction of RA suggests a high sensitivity and val- 
idity of the current approach. In addition, the present 
study for the first time addressed a multicenter cross- 
validation and may thus contribute to the identification 
of molecules with universal relevance in heterogeneous 
RA patient cohorts, possibly including the previously 
undescribed STK10. 

At a molecular level, the biomarkers were signifi- 
cantly overexpressed in RA synovial tissue (mostly in 
the study groups from all three centers), not only in 
comparison with healthy controls, but also with the 
'disease' control OA. In addition, significant overexpres- 
sion was not limited to the synovial tissue as a whole, 
but also occurred in isolated synovial fibroblasts, a cell 
population regarded as highly important for chronic in- 
flammatory RA. 

In perspective, validation, refinement, and generalization 
of the present rule-based, discriminative procedure in a 
larger prospective cohort are necessary. The identified bio- 
markers may prove useful for diagnosis or differential 
diagnosis of RA patients (including potential subpopula- 
tions), as well as for stratification and monitoring of 
(responders and nonresponder) patients in routine or ex- 
perimental clinical trials. 
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Additional files 



Additional file 1: Calculation of the relevance index. 

Additional file 2: Internal validation of rule sets. 

Additional file 3: List of the 'complete primary rule sets' for all 
datasets, as well as the 'Rule overlap among data sets'. The data are 
displayed as either 'complete primary rule sets' with the pruned 
(optimized) rules highlighted in bold (Sheet 1) or as the 'Rule Overlap 
among Data Sets' (Sheet 2) with the rules/genes showing an overlap 
between the three independent study groups 'Jena', 'Berlin', and 'Leipzig' 
highlighted in grey. In both cases, the rules were generated as stated in 
Materials and methods ('Rule set generation') and the ranks of the 
individual rules in the respective dataset are indicated. 

Additional file 4: Listing of the overlap among the different rule sets. 

The data are displayed as the 'Rule Overlap among Data Sets' including the 
gene names. The ranks of the individuals rules in the respective dataset are 
indicated and the rules/genes showing an overlap between the three 
independent study groups 'Jena', 'Berlin', and 'Leipzig' are highlighted in grey. 

Additional file 5: Log-fold change and P values for differentially 

expressed genes. Log-fold change (log2 FC) and P values (Mann Whitney 
U test, red: P> 0.05) for the genes differentially expressed among patients 
with a different clinical status (genes significantly overexpressed in RA versus 
both CG and OA are highlighted in grey; see also Table 6). 

Additional file 6: Genes (original symbols) and the synonyms used as 
input for the Pathway Studio 9 search for interactions among the genes. 

Additional file 7: Interactions among the genes in the pruned rule sets 

(CG). Interactions found by Pathway Studio among the genes contained in 
the pruned rule sets of the 'Jena' and 'Berlin' datasets for the conclusion 'CG'. 

Additional file 8: Interactions among the genes in the pruned rule sets 

(RA). Interactions found by Pathway Studio among the genes contained in 
the pruned rule sets of the 'Jena', 'Berlin' and 'Leipzig' datasets for the 
conclusion 'RA'. 

Additional file 9: Gene enrichment analysis for molecular 
interpretation (CG). Gene enrichment analysis for molecular interpretation of 
the obtained rule set for the conclusion 'CG' using the GO terms biological 
process (BP) and molecular function (MF), as well as KEGG pathways. The 
analyses were performed separately for the 'CG' rules showing a high or 
low expression level. Category = type of term (GO term/KEEG pathway); 
Term = denomination of term (interesting terms highlighted in grey); 
Count = list hits; number of genes in the rule set belonging to the term in 
question; p value = EASE score (upper boundary of the distribution of 
Jackknife Fisher exact probabilities given the actual Count, List Total, Pop Hits, 
and Pop Total); Genes = gene symbols of included rules/genes; List 
Total = number of genes in the rule set (for high and low expression, 
respectively); Pop Hits = number of genes in the population background 
belonging to the specific term; Pop Total = number of genes in the 
population background; BH-adjusted p value = Benjamini-Hochberg adjusted 
P value (threshold P < 0.05 indicated by fat frame). 

Additional file 10: Gene enrichment analysis for molecular 
interpretation (RA). Gene enrichment analysis for molecular interpretation of 
the obtained rule set for the conclusion 'RA' using the GO terms biological 
process (BP) and molecular function (MF), as well as KEGG pathways. The 
analyses were performed separately for the 'RA' rules showing a high or low 
expression level. In the case of 'RA' rules showing a low expression level, there 
were only results for the GO terms BP and MF. Category = type of term (GO 
term/KEEG pathway); Term = denomination of term (interesting terms 
highlighted in grey); Count = list hits; number of genes in the rule set 
belonging to the term in question; p value = EASE score (upper boundary of 
the distribution of Jackknife Fisher exact probabilities given the actual Count, 
List Total, Pop Hits, and Pop Total); Genes = gene symbols of included rules/ 
genes; List Total = number of genes in the rule set (for high and low 
expression, respectively); Pop Hits = number of genes in the population 
background belonging to the specific term; Pop Total = number of genes in 
the population background; BH-adjusted p value = Benjamini-Hochberg 
adjusted P value (threshold P < 0.05 indicated by fat frame). 

Additional file 11: Gene enrichment analysis for molecular 
interpretation (OA). Gene enrichment analysis for molecular 



interpretation of the obtained rule set for the conclusion 'OA' using the 
GO terms biological process (BP) and molecular function (MF), as well as 
KEGG pathways. The analyses were performed separately for the 'OA' 
rules showing a high or low expression level. There were only results for 
the GO term MF in 'OA' rules showing a high expression level. Category 
= type of term (GO term/KEEG pathway); Term = denomination of term; 
Count = list hits; number of genes in the rule set belonging to the term 
in question; p value = EASE score (upper boundary of the distribution of 
Jackknife Fisher exact probabilities given the actual Count, List Total, Pop 
Hits, and Pop Total); Genes = gene symbols of included rules/genes; List 
Total = number of genes in the rule set (for high and low expression, 
respectively); Pop Hits = number of genes in the population background 
belonging to the specific term; Pop Total = number of genes in the 
population background; BH-adjusted p value = Benjamini-Hochberg 
adjusted P value. 
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